Airbnb, .
Airbnb not only has changed the possibilites of travel and ways of living, but also brought new business potetials. We are interested in exploring the data generated from Airbnb, analyzing the data to find interesting facts about airbnb listing in New York. We hope we could generate some useful insights to provide guidances for customers and business suggestions for hosts.
Data source: http://insideairbnb.com/get-the-data.html.
The data source provides a dataset of information from airbnb. We use the most recent (Sep, 2019) dataset for New York. The data is not cleaned, so we need to spend some time to tidy it.
The first obstacle is that the data file is relatively large. The csv. file downlowded is more than 180MB. So we just select the columns relavent to our analysis. We also exclude lists that without reviews, as we want to focus on.
【Another obstacle is that some information is compressed into one cell, like amenities. Amenities compress all the amenities provided in to one lone string, and it is not seperated by simple delimiter. We have tried to transform amenities into a seperate tidy data frame. We will try to clean other similar columns.】(不用的话可以删了)
Additionally, we define an active list as the onces received reviews within the past 12 months. Also we only includes data having price larger than zero, since we found that occurence of zero in price may due to error in data colleting.
For the cleaning fee (cleaning_fee) and security deposit (security_deposit), it is intuitive to replace the missings by 0. There are small amount of other variables are missing, we simply exclude the listings. It might be also due to collection error of the original data.
Finally, there are 28098 listings and 62 variables for our analysis. The variables includes
amentities for backup
#split amenities into tidy data frame
##function to modify amenities string
split_amenities<-function(x){
#x is an input string
x<-str_replace_all(x,"[{}]",",") #remove {}
temp<-str_split(x,"\\\"")[[1]] #split by \"
temp[str_starts(temp,"[,]") | str_ends(temp,"[,]")]<-str_replace_all(temp[str_starts(temp,"[,]") | str_ends(temp,"[,]")],"[,]","|")
temp<-str_remove(temp,"^\\|")
temp<-str_remove(temp,"\\|$")
temp<-temp[temp!=""]
out<-paste(temp,collapse = "|")
return(out)
}
#38 list has no amenities
dat2<-dat1%>%select(id, amenities)%>%
rowwise()%>%
mutate(amenity = split_amenities(amenities))
dat3<-dat2%>%separate_rows(amenity,sep="\\|")
##output amenities
#write_csv(dat3%>%select(-amenities),"./data/amenities_201909.csv")
The majority of listings in our data are among Manhattan and Brooklyn. The closer to the new york city downtown area, the denser the listings.
## map for listings
data = airbnb_cleaned
data_map_all <- data %>% dplyr::select(id, longitude, latitude)
sbbox <- make_bbox(lon = data_map_all$longitude, lat = data_map_all$latitude, f = .001)
ny_map_all <- get_map(location = sbbox, maptype = "satellite", source = "google")
ggmap(ny_map_all) +
geom_point(data = data_map_all, mapping = aes(x = longitude, y = latitude), color = "red", size = 0.0011, alpha = 0.6)
Booking price is always an important factor both customers and hosts care about. In this section, we want to explore the facts of airbnb booking price in the market of New York.
The first plot is a general plot of distribution of price, we could see that it is skewed right with very long tail. Thus, we do a log 10 transformation so that we could have better visualization. The distribution shows that the median of price is 100.
The boxplot in the following also shows that most of the listings have a price under 200 dollars. There are some listings have ridiculous high prices. What are these listings? Why are they so expensive? Let’s further explore these lisitngs with high prices.
We foucus on the listings with price more than $600. We could see that most of the listings belongs to Manhattan area, which is reasonable. Futhermore, most of the listings with high price are in Midtown, which is also not unexpected, because Midtown is the central portion of Manhattan.
Moreover, we find the room type of most of these listings with high price is entire home/apt and the number of bedrooms are about 2-4. This result may explain the high price of these listings. Booking a big apratment in Midtown of Manhattan is reasonable to be expensive.
Based on the price distribution above, we are interested in how prices are affected by different locations. We plotted the below push pin map which shows the price distribution among new york districts. Most of the listings are among $100-$250 per night and the closer we are to new york downtown, the higher the price. The purple points indicate expensive listings above $250 per night. We can rarely see listings that cost more than $500 per night.
Indeed, price is largely affected by location. To better illustrate, we plotted the average price per night for the five neighborhoods.
We can see that Bronx, Staten Island, Queens are almost the same. Brooklyn and Manhattan are more expensice by about $50 to $100 per night, which is a large amount considering our majority of the prices are around $100 per night.
The price distribution for different neighborhoods share the same shape. They have relatively the same variance but different mean, which corresponds to our previous observation.
When we are browing through different listings, the price that Airbnb site displays does not include the deposite needed or the cleaning fee. We are interested in whether the deposit and cleaning fee are proportional to the listing price because intuitively we think they are.
plot2 <-ggplot(data = airbnb_cleaned, aes(x = security_deposit)) +
geom_histogram(aes(y=..density..),binwidth = 50, color = 'white', fill = '#3CA9FC')+
geom_density() +
ggtitle("Distribution of security deposit($)")
require(gridExtra)
grid.arrange(plot1, plot2, ncol=2)
There is only a few of listings with 0 cleaning fees and almost all of them are within $200, which is a pretty high range considering our median listing price per night is $100. There’s only 3% of the listings do not require security deposit and surprisingly there are about 0.25% (about 7000 listings) with security deposit of $500.
We use scatter plots to explore the relationship between price and cleaning fee and security deposit.
We did not observe any obvious pattern with price vs cleaning fee or price vs security deposit. For the scatterplot of price vs security deposit, there are some horizontal lines because security deposit tend to be set to integers so the “lines” appear because they are discrete data.
Room type is also an important factor customers care about. There are four room types in Airbnb market: entire home/apt, private room, shared room, and hotel room.
First, we take a look at the price distribution for different room type. We could see that entire home/apt and hotel/room have realative higher price than private room and shared room, which is reasonable. Additionally, shared room has realtive lowest price in the market. These findings does not violate our common sense.
Surprisingly, at such an expensive living area like Manhattan, lots of listings are “Entire Home/Apt”. It’s true that Manhattan has the most expensive listings but they actually also have relative good qualities (instead of being all shared rooms and small private rooms).
There are other variables may be important factors that affects price, such as number of reveiws, number of minimum night, host response rate, and review scores rating. However, from the pairwise scatter plot, we could see that price actually does not have strong correlation with these variables. The price is set by hosts and not every host have multiple financial and marketing information. Additionally, not all the host’s purpose is making money, so price may not reflect actual market values. Price setting is individual subjective. Therefore, we could see that a high price does not guarantee a good services.
Analyzing reviews can provide customers’ opinions of valuaing the quality of airbnb services. Customers can rate an Airbnb stay in six aspects: cleanliness, accuracy, checking, communication, location, and value which represent an overall feeling of the stay. It is useful to explore what aspect affects the “value rating” the most. Knowing customers’ cares can help hosts to understand how to provide better service and get better rating results.
#Data Importing and Cleaning
data <- read_csv ("listings_201909.csv", na=c("","NA","N/A"))
##select columns
dat<-data%>%
select(-c(scrape_id:xl_picture_url))%>%
select(-host_url, -host_name, -host_location, -host_about,
-host_acceptance_rate, -host_thumbnail_url, -host_picture_url,
-host_neighbourhood)%>%
select(-street,-city,-state, -market, -smart_location, -country,
-country_code)%>%
select(-jurisdiction_names, -license,-weekly_price,-monthly_price,
-square_feet)%>%
select(-c(calendar_updated:calendar_last_scraped))
#dim(dat) #48377 62
##modify data type
dat<-dat%>%
mutate_at(c("host_response_rate","extra_people",
"price","security_deposit","cleaning_fee"),
str_remove_all,pattern="[%$]")%>%
mutate_at(c("host_response_rate","extra_people",
"price","security_deposit","cleaning_fee"),
as.numeric)
##select lists which have reviews within the last 12 months
data_cleaned<-dat%>%filter(!is.na(first_review))%>%
filter(last_review>='2018-09-01')
#dim(dat1) #28105 62
#clean NA
##replace missing values
data_cleaned$cleaning_fee[is.na(data_cleaned$cleaning_fee)] <- 0
data_cleaned$security_deposit[is.na(data_cleaned$security_deposit)] <- 0
data_cleaned<-data_cleaned%>%filter(price>0)
##exclude missing values
completeFun <- function(data, desiredCols) {
completeVec <- complete.cases(data[, desiredCols])
return(data[completeVec, ])
}
data_cleaned <- completeFun(data_cleaned, c("review_scores_value", "review_scores_checkin", "review_scores_accuracy", "review_scores_communication", "review_scores_cleanliness","review_scores_rating","neighbourhood", "review_scores_location", "price", "bedrooms", "beds","bathrooms", "host_identity_verified","zipcode"))
#sort(colSums(is.na(data_cleaned)),decreasing = TRUE)
dim(data_cleaned) #28098
#cleaned data output
##write_csv(data_cleaned,"data_cleaned.csv")